User Cohorts Analysis

Understanding user behavior over time is crucial for any business, especially in fast-growing startups where data-driven decisions can define success. One of the most effective ways to analyze retention, engagement, and growth is through cohort analysis.

A user cohort table segments users based on shared characteristics—such as sign-up month or first purchase—and tracks their activity across time. This approach helps businesses measure customer retention, assess the impact of product changes, and optimize marketing strategies.

In this notebook, we’ll walk through the process of analyzing a cohort table step by step. Using Python, we’ll transform raw data into actionable insights, visualizing retention curves and uncovering trends that can inform strategic decisions. Whether you’re working in SaaS, e-commerce, or mobile apps, this guide will equip you with the tools to extract meaningful patterns from user data.

Let’s dive in.

from config import *
%matplotlib inline
cmap = plt.get_cmap('viridis')

Right-aligned table

For example, the number of users who made their first purchase in March and are still engaging with the platform in June are 9,468

nc_cohorts_right = pd.read_csv('user_cohorts.csv')
nc_cohorts_right.rename(columns={'Num Users': 'month'}, inplace=True)
ind = nc_cohorts_right['month']
nc_cohorts_right.set_index('month', inplace=True)

nc_cohorts_right_filled = nc_cohorts_right.fillna(0)

styled_df = nc_cohorts_right.style.background_gradient(cmap=cmap, axis=1, vmax=80000, vmin=1000).format('{:,.0f}')

styled_df = styled_df.apply(
    lambda row: ['background-color: white; color: white' if pd.isna(v) else '' for v in row]
)

styled_df

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
month
Jan	689	414	379	342	347	362	344	349	347	339	342	337
Feb	nan	5,595	3,039	2,633	2,565	2,408	2,366	2,318	2,109	2,166	2,051	2,026
Mar	nan	nan	20,041	11,037	9,872	9,468	8,807	8,682	8,025	7,788	7,504	7,057
Apr	nan	nan	nan	30,278	15,831	14,449	13,351	12,587	11,856	11,180	10,988	9,999
May	nan	nan	nan	nan	35,630	16,924	14,768	13,840	12,966	12,260	12,118	11,082
Jun	nan	nan	nan	nan	nan	47,489	21,653	18,921	17,141	15,806	15,654	14,102
Jul	nan	nan	nan	nan	nan	nan	78,425	34,774	29,399	26,073	25,229	23,016
Aug	nan	nan	nan	nan	nan	nan	nan	109,117	43,172	35,336	33,469	30,535
Sep	nan	nan	nan	nan	nan	nan	nan	nan	116,187	40,402	35,154	31,603
Oct	nan	nan	nan	nan	nan	nan	nan	nan	nan	117,185	34,462	29,055
Nov	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	107,597	32,920
Dec	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	97,164

retained_clients = nc_cohorts_right.Dec.sum()
print(f"Total reatined clients: {int(retained_clients)}")

Total reatined clients: 288895

We can more intuitively visualize the table with a simple chart.

colors = cmap(np.linspace(0, 1, nc_cohorts_right.shape[1]))

fig, ax = plt.subplots(figsize=(8, 5))

ax.stackplot(nc_cohorts_right_filled.index, nc_cohorts_right_filled, labels=nc_cohorts_right_filled.columns, colors=colors, alpha=0.8)

ax.legend(loc='upper left', title='Cohort')
ax.set_title('Users Per Month', loc='left')
ax.set_xlabel('Month')
ax.set_ylabel('Number of Users')
ax.grid(False)

plt.show()

In this case, I noticed that the growth is slowing down. Remember, you always want to confirm your visual intuition with data: in this case, i approximated the growth curve with a known (continuous) function. As the function approximates well the curve, I can use its first derivative to draw conclusions about the instantaneous growth.

trend = nc_cohorts_right.sum()
t = np.arange(1, len(trend) + 1)

def logistic_function(t, L, k, t0):
    return L / (1 + np.exp(-k * (t - t0)))

[L, k, t0], _ = curve_fit(logistic_function, t, trend, p0=[trend.max(), 1, 0])

approx = logistic_function(t, L, k, t0)
intensity = k * L * np.exp(-k * (t - t0)) / (1 + np.exp(-k * (t - t0)))**2

R2 = 1 - np.sum((trend - approx) ** 2) / np.sum((trend - np.mean(trend)) ** 2)

fig, ax = plt.subplots(2, 1, figsize=(8, 6))
ax[0].plot(trend, label="Trend")
ax[0].plot(approx, label="Logistic Fit")
ax[0].set_title(f'Approximation of Growth with Logistic Function: R² = {round(R2, 3)}')
ax[0].legend()

ax[1].plot(intensity, color="red", label="Derivative")
ax[1].set_title("Derivative of Logistic Function (Growth Intensity)")
ax[1].legend()

plt.tight_layout()

Left-aligned table

In the following table: * every row corresponds to a single cohort of users, in this case those who made their first purchase in the given month * every column indicates the number of users still active after n months, where n is the column header. n=0 corresponds to the month in which the users made their first purchase

This table is very useful, as it allows to analyze the behaviour of different cohorts during their lifetime

nc = nc_cohorts_right.copy().to_numpy()
for i, row in enumerate(nc):
    a = np.argwhere(np.isnan(row))
    if a.size != 0:
        start = a.max() + 1
        temp = row[start:]
        if temp.size < 12:
            temp = np.append(temp, np.full(shape=(12 - temp.size), fill_value=np.nan))
            nc[i] = temp
    else:
        pass
    
nc_cohorts_left = pd.DataFrame(nc, index=ind)

styled_df = nc_cohorts_left.style.background_gradient(cmap=cmap, axis=1, vmax=80000, vmin=1000).format('{:,.0f}')

styled_df = styled_df.apply(
    lambda row: ['background-color: white; color: white' if pd.isna(v) else '' for v in row]
)

styled_df

	0	1	2	3	4	5	6	7	8	9	10	11
month
Jan	689	414	379	342	347	362	344	349	347	339	342	337
Feb	5,595	3,039	2,633	2,565	2,408	2,366	2,318	2,109	2,166	2,051	2,026	nan
Mar	20,041	11,037	9,872	9,468	8,807	8,682	8,025	7,788	7,504	7,057	nan	nan
Apr	30,278	15,831	14,449	13,351	12,587	11,856	11,180	10,988	9,999	nan	nan	nan
May	35,630	16,924	14,768	13,840	12,966	12,260	12,118	11,082	nan	nan	nan	nan
Jun	47,489	21,653	18,921	17,141	15,806	15,654	14,102	nan	nan	nan	nan	nan
Jul	78,425	34,774	29,399	26,073	25,229	23,016	nan	nan	nan	nan	nan	nan
Aug	109,117	43,172	35,336	33,469	30,535	nan	nan	nan	nan	nan	nan	nan
Sep	116,187	40,402	35,154	31,603	nan	nan	nan	nan	nan	nan	nan	nan
Oct	117,185	34,462	29,055	nan	nan	nan	nan	nan	nan	nan	nan	nan
Nov	107,597	32,920	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
Dec	97,164	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan

print(f"Total initial users: {int(nc_cohorts_left[0].sum())}")

Total initial users: 765395

Assessing retention rates

With a left-aligned table, we can analyse retention rates for every cohort. In this case, retention rates are calculated as: \[ \frac{Active\, users\, at\, month\, n}{Active\, users\, at\, month\, n-1} \]

retention = nc_cohorts_left.copy()

for col in retention.columns:
    if col == 0:
        retention[col] = 1
    else:
        retention[col] = nc_cohorts_left[col]/nc_cohorts_left[0]


styled_df = retention.style.background_gradient(cmap=cmap, axis=1, vmax=.60, vmin=.30).format('{:.2%}')

styled_df = styled_df.apply(
    lambda row: ['background-color: white; color: white' if pd.isna(v) else '' for v in row]
)

styled_df

	0	1	2	3	4	5	6	7	8	9	10	11
month
Jan	100.00%	60.14%	55.07%	49.64%	50.36%	52.54%	50.00%	50.72%	50.36%	49.28%	49.64%	48.91%
Feb	100.00%	54.33%	47.06%	45.85%	43.04%	42.28%	41.44%	37.69%	38.72%	36.66%	36.22%	nan%
Mar	100.00%	55.07%	49.26%	47.24%	43.94%	43.32%	40.04%	38.86%	37.44%	35.21%	nan%	nan%
Apr	100.00%	52.29%	47.72%	44.09%	41.57%	39.16%	36.92%	36.29%	33.03%	nan%	nan%	nan%
May	100.00%	47.50%	41.45%	38.84%	36.39%	34.41%	34.01%	31.10%	nan%	nan%	nan%	nan%
Jun	100.00%	45.60%	39.84%	36.10%	33.28%	32.96%	29.70%	nan%	nan%	nan%	nan%	nan%
Jul	100.00%	44.34%	37.49%	33.25%	32.17%	29.35%	nan%	nan%	nan%	nan%	nan%	nan%
Aug	100.00%	39.56%	32.38%	30.67%	27.98%	nan%	nan%	nan%	nan%	nan%	nan%	nan%
Sep	100.00%	34.77%	30.26%	27.20%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%
Oct	100.00%	29.41%	24.79%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%
Nov	100.00%	30.60%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%
Dec	100.00%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%

We can then easily visualize the retention trend across different cohorts with a line chart. Ideally, retention rates should improve with time, although this is not the case.

fig, ax = plt.subplots(1, 1)
retention.T.plot(marker='s', ax=ax, color=colors, alpha=0.67)

ax.set_title('Hanging Ribbons', loc='left')
ax.set_ylabel('% Of Cohort Retained')
ax.set_xlabel('Time')

plt.gca().yaxis.set_major_formatter(PercentFormatter(xmax=1))

ax.legend(
    title='Cohort',
    loc='upper center',         
    ncol=5,
    frameon=True
)

plt.tight_layout()
plt.show()

We can easily check that retention is negatively correlated with time (and size of the cohorts)

from sklearn.preprocessing import minmax_scale
scaled_nc_cohorts_left = minmax_scale(range(1, len(nc_cohorts_left.index)+1), axis=0)
scaled_retention = minmax_scale(retention.iloc[:, 1:4], axis=0)

df_scaled = pd.DataFrame(
    scaled_nc_cohorts_left, columns=["nc_cohorts_left"]
).join(pd.DataFrame(scaled_retention, columns=["retention_1", "retention_2", "retention_3"]))

corr_matrix = df_scaled.corr()

sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", cbar=False)
plt.title("Correlation Matrix")
plt.tight_layout()

Assessing churn rates

With a left-aligned table, we can also analyse churn rates for every cohort. Churn rates in the following table are calculated as: \[ \frac{Churned\, users\, at\, month\, n}{Active\, users\, at\, month\, n-1} \]

churn = nc_cohorts_left.copy()

for col in churn.columns:
    if col == 0:
        churn[col] = 0
    else:
        churn[col] = (nc_cohorts_left[col-1]-nc_cohorts_left[col])/nc_cohorts_left[col-1]

styled_df = churn.style.background_gradient(cmap=cmap, axis=1, vmax=.2, vmin=0).format('{:.2%}')

styled_df = styled_df.apply(
    lambda row: ['background-color: white; color: white' if pd.isna(v) else '' for v in row]
)

styled_df

	0	1	2	3	4	5	6	7	8	9	10	11
month
Jan	0.00%	39.86%	8.43%	9.87%	-1.46%	-4.32%	4.83%	-1.45%	0.71%	2.16%	-0.74%	1.46%
Feb	0.00%	45.67%	13.38%	2.56%	6.13%	1.76%	2.00%	9.04%	-2.72%	5.30%	1.22%	nan%
Mar	0.00%	44.93%	10.56%	4.10%	6.98%	1.42%	7.56%	2.95%	3.65%	5.95%	nan%	nan%
Apr	0.00%	47.71%	8.73%	7.60%	5.72%	5.81%	5.70%	1.72%	8.99%	nan%	nan%	nan%
May	0.00%	52.50%	12.74%	6.29%	6.31%	5.45%	1.16%	8.55%	nan%	nan%	nan%	nan%
Jun	0.00%	54.40%	12.62%	9.40%	7.79%	0.96%	9.92%	nan%	nan%	nan%	nan%	nan%
Jul	0.00%	55.66%	15.46%	11.31%	3.24%	8.77%	nan%	nan%	nan%	nan%	nan%	nan%
Aug	0.00%	60.44%	18.15%	5.28%	8.77%	nan%	nan%	nan%	nan%	nan%	nan%	nan%
Sep	0.00%	65.23%	12.99%	10.10%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%
Oct	0.00%	70.59%	15.69%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%
Nov	0.00%	69.40%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%
Dec	0.00%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%	nan%

It’s interesting to calculate and plot the weighted average churn rate for every month. In this case, most of the users disengage after the first month: this is because they were attracted with offers and promotions, and left the platforms as soon as they used those incentives.

churned = nc_cohorts_left.copy()
retained = nc_cohorts_left.copy()

for col in churned.columns:
    if col == 0:
        churned[col] = 0
    else:
        churned[col] = (nc_cohorts_left[col-1]-nc_cohorts_left[col])
        
avg_churn = np.zeros(11)

retained_sum = retained.iloc[:-1,:].sum()
churned_sum = churned.iloc[:-1,1:].sum()

for i in range(len(avg_churn)):
    avg_churn[i] = churned_sum[i+1] / retained_sum[(i)]
    
avg_churn

array([0.61894928, 0.12467046, 0.0687431 , 0.05115784, 0.03639244,
       0.04167227, 0.03471718, 0.0376834 , 0.02842538, 0.00237718,
       0.00210748])

std = churn.iloc[:-1,1:].std()

fig, ax = plt.subplots(1, 1,figsize=(8,5))
ax.plot(range(1, 12, 1), avg_churn, label='Mean', color='#00A082')

plt.fill_between(range(1, 12, 1), avg_churn - std, avg_churn + std, alpha=0.2, label='±1 Std Dev', color='#00A082')

plt.xlabel('Month')
plt.ylabel('% of churned customers (relative to prev. month)')
plt.title('Churned customers in lifetime month', loc='left')
plt.gca().yaxis.set_major_formatter(PercentFormatter(xmax=1))


plt.legend()
plt.tight_layout()

I then wanted to see how many additional clients could’ve been retained at the end of the year by reducing the churn rate by 5 percentage points in the first month

avg_churn_proj = avg_churn.copy()
avg_churn_proj[0] = avg_churn_proj[0]-.05
starting_users = nc_cohorts_left[0].values
projection = np.zeros([12,12])
projection[:,0] = starting_users
for i in range(11):
    projection[:(11-i),i+1] = projection[:(11-i),i] * (1-avg_churn_proj[i])
    
total_clients = 0
for i in range(len(projection)):
    total_clients += projection[i, 11-i]
    
print(f"Additional clients: {int(total_clients - retained_clients)}")

Additional clients: 42441.379632713506

To gauge the average time clients who get past the first month get engaged with the service, it is userful to calculate the (truncated) lifespan of the users. This is simply calculated as: \[ \frac{1}{avg\, churn\, rate} \]

1/((nc_cohorts_left.sum()[1:-1]*avg_churn[1:]).sum()/nc_cohorts_left.sum()[1:-1].sum())

14.111089070031147